Memory access register file

ABSTRACT

The general idea according to the invention is to introduce a special purpose register file ( 34 ) adapted for holding memory address calculation information received from memory ( 50, 70 ) and to provide one or more dedicated interfaces ( 73, 74 ) for allowing efficient transfer of memory address calculation information in relation to the special-purpose register file. The special-purpose register file ( 34 ) is preferably connected to at least one functional processor unit ( 42 ), which is operable for determining a memory address based on memory address calculation information received from the special-purpose register file ( 34 ). Once the memory address has been determined, the corresponding memory access can be effectuated.

TECHNICAL FIELD OF THE INVENTION

The present invention generally relates to processor technology andcomputer systems, and more particularly to a hardware design forhandling memory address calculation information in such systems.

BACKGROUND OF THE INVENTION

With the ever-increasing demand for faster and more effective computersystems naturally comes the need for faster and more sophisticatedelectronic components. The computer industry has been extremelysuccessful in developing new and faster processors. The processing speedof state-of-the-art processors has increased at a spectacular rate overthe past decades. However, one of the major bottlenecks in computersystems is the access to the memory system, and the handling of memoryaddress calculation information. This problem is particularly pronouncedin applications with implicit memory address information, requiringsequenced memory address calculation. A sequenced memory addresscalculation based on implicit memory address information generallyrequires several clock cycles before the actual data corresponding tothe memory address can be read.

In systems using dynamic linking, for example systems with dynamicallylinked code that can be reconfigured during operation, memory addressesare generally determined by means of several table look-ups in differenttables. This typically means that an initial memory address calculationinformation may contain a pointer to a first look-up table, and thattable holds a pointer to another table, which in turn holds a pointer toa further table and so on until the target address can be retrieved froma final table. With several look-up tables, a lot of memory addresscalculation information must be read and processed before the targetaddress can be retrieved and the corresponding data accessed.

Another situation where the handling of memory address calculationinformation really becomes a major bottleneck is when a CISC (ComplexInstruction Set Computer) instruction set is emulated on a RISC (ReducedInstruction Set Computer) or VLIW (Very Long Instruction Word)processor. In such a case, the complex CISC memory operations can not bemapped directly to a corresponding RISC instruction or to an operationin a VLIW instruction. Instead, each complex memory operation is mappedto a sequence of instructions that performs memory address calculations,memory mapping and so forth. Several problems arise with the emulation,including low performance due to a high instruction count, high registerpressure since many registers are used for storing temporary results,and additional pressure on load/store units in the processor forhandling address translation table lookups.

A standard solution to the problem of handling implicit memory addressinformation, in particular during instruction emulation, is to rely asmuch as possible on software optimizations for reducing the overheadcaused by the emulation. But software solutions can only reduce theperformance penalty, not solve it. There will consequently still be alarge amount of memory operations to be performed. The many memoryoperations may be performed either serially or handled in parallel withother instructions by making the instruction wider. However, serialperformance requires more clock cycles, whereas a wider instruction willgive a high pressure on the register files, requiring more registerports and more execution units. Parallel performance thus gives a largerand more complex processor design but also a lower effective clockfrequency.

An alternative solution is to devise a special-purpose instruction setin the target architecture. This instruction set can be provided withoperations that perform the same complex address calculations that areperformed by the emulated instruction set. Since the complex addresscalculations are intact, there is less opportunity for optimizationswhen mapping the memory access instructions into a special purposenative instruction. Although the number of instructions required foremulation of complex addressing modes can be reduced, this approach thusgives less flexibility.

Even with special-purpose instructions, there will normally be extraloads for loading the implicit memory access information. Emulatorsusually keep these in memory and cache them as any other data. Thisgives additional memory reads for each memory access in the emulatedinstruction stream, and thus requires a larger data cache with moreassociativity. This is generally not an option in modern processors thatare optimized for highest possible clock frequency. In addition,implicit memory access information typically does not fit directly innormal-sized words. The common way of handling this problem is to useseveral instructions for reading the information from memory, which ineffect means that additional instructions have to be executed.

U.S. Pat. No. 5,696,957 describes an integrated unit adapted forexecuting a plurality of programs, where data stored in a register setmust be replaced each time a program is changed. The integrated unit hasa central processing unit (CPU) for executing the programs and aregister set for storing crate required for executing a program in theCPU. In addition, a register-file RAM is coupled to the CPU for storingat least the same data as that stored in the register set. The storeddata of the register-file RAM may then be supplied to the register setwhen a program is replaced.

SUMMARY OF THE INVENTION

The present invention overcomes these and other drawbacks of the priorart arrangements.

It is a general object of the present invention to improve theperformance of a computer system.

It is another object of the invention to increase the effective memoryaccess bandwidth in the system.

Yet another object of the invention is to provide an efficient memoryaccess system.

Still another object of the invention is to provide a hardware designfor effectively handling memory address calculation information in acomputer system.

It is also an object of the invention to minimize interconnect delays insilicon implementations.

These and other objects are met by the invention as defined by theaccompanying patent claims.

The general idea according to the invention is to introduce aspecial-purpose register file adapted for holding memory addresscalculation information received from memory and to provide one or morededicated interfaces for allowing efficient transfer of memory addresscalculation information in relation to the special-purpose registerfile. The special-purpose register file is preferably connected to atleast one functional processor unit, which is operable for determining amemory address based on memory address calculation information receivedfrom the special-purpose register file. Once the memory address has beendetermined, the corresponding memory access can be effectuated.

For efficient loading of memory address calculation information, such asimplicit memory access information, into the special-purpose registerfile, the special register file is preferably provided with a dedicatedinterface towards memory.

For efficient transfer of the memory address calculation informationfrom the special-purpose register file to the relevant functionalprocessor unit or units, the special register file is preferablyprovided with a dedicated interface towards the functional processorunit or units.

By having dedicated data paths to and/or from the special-purposeregister file, memory address calculation information can be transferredin parallel with other data that are transferred to and/or from thegeneral register file of the computer system. This results in aconsiderable increase of the overall system efficiency.

The special-purpose register file and its dedicated interface orinterfaces do not have to use the same width as the normal registers anddata paths in the system. Instead, as the address calculationinformation is typically wider, it is beneficial to utilizewidth-adapted data paths for transferring the address calculationinformation to avoid multi-cycle transfers.

In a preferred embodiment of the invention, the overall memory systemincludes a dedicated cache adapted for the memory address calculationinformation, and the special-purpose register file is preferably loadeddirectly from the dedicated cache via a dedicated interface between thecache and the special register file.

It has turned out to be advantageous to use special-purpose instructionsfor loading the special-purpose register file. In similarity,special-purpose instructions may also be used for performing the actualaddress calculations based on the address calculation information.

The invention offers the following advantages:

-   -   Improved general system performance;    -   Increased memory access bandwidth;    -   Efficient handling of memory address calculation information;        and    -   Optimized silicon implementations.        Other advantages offered by the present invention will be        appreciated upon reading of the below description of the        embodiments of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further objects and advantages thereof,will be best understood by reference to the following description takentogether with the accompanying drawings, in which:

FIG. 1 is a schematic block diagram of a computer system in which thepresent invention can be implemented;

FIG. 2 is a schematic block diagram illustrating relevant parts of acomputer system according to an embodiment of the invention;

FIG. 3 is a schematic block diagram illustrating relevant parts of acomputer system according to another embodiment of the presentinvention;

FIG. 4 is a schematic block diagram illustrating parts of a computersystem according to a further embodiment of the present invention;

FIG. 5 is a schematic block diagram illustrating relevant parts of acomputer system according to yet another embodiment of the presentinvention;

FIG. 6 is a schematic principle diagram illustrating three memory readsin a prior art computer system;

FIG. 7 is a schematic principle diagram illustrating three memory readsin a computer system according to an embodiment of the presentinvention;

FIG. 8 is a schematic principle diagram illustrating three memory readsin a computer system according to a preferred embodiment of the presentinvention; and

FIG. 9 is a schematic block diagram of a VLIW-based computer systemaccording to an exemplary embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Throughout the drawings, the same reference characters will be used forcorresponding or similar elements.

FIG. 1 is a schematic block diagram of an example of a computer systemin which the present invention can be implemented. The system 100basically comprises a central processing unit (CPU) 10, a memory system50 and a conventional input/output (I/O) unit 60. The CPU 10 comprisesan optional on-chip cache 20, a register bank 30 and a processor 40. Thememory system 50 may have any general design known to the art. Forexample, the memory system 50 may be provided with a data store as wellas a program store including operating system (OS) software,instructions and references. In the present invention, the register bank30 includes a special-purpose register file 34 referred to as an accessregister file 34, together with other register files such as the generalregister file 32 of the CPU.

The general register file 32 typically includes a conventional programcounter as well as registers for holding input operands required duringexecution. However, the information in the general register file 32 ispreferably not related to memory address calculations. Instead, suchmemory address calculation information is kept in the special-purposeaccess register file 34, which is adapted for this type of information.The memory address calculation information is generally in the form ofimplicit or indirect memory access information such as memory referenceinformation, address translation information or memory mappinginformation.

Implicit memory access information does not directly point out alocation in the memory, but rather includes information necessary fordetermining the memory address of some data stored in the memory. Forexample, implicit memory access information may be an address to amemory location, which in turn contains the address of the requesteddata, i.e. the effective address, or yet another address to a memorylocation, which in turn contains the effective address. Another exampleof implicit memory access information is address translationinformation, or memory mapping information. Address translation ormemory mapping are terms for mapping a virtual memory block, or page, tothe physical main memory. A virtual memory is generally used forproviding fast access to recently used data or recently used portions ofprogram code. However, in order to access the data associated with anaddress in the virtual memory, the virtual address must first betranslated into a physical address. The physical address is then used toaccess the main memory.

The processor 40 may be any processor known to the art, as long as ithas processing capabilities that enable execution of instructions. Inthe computer system 100 according to the exemplary embodiment of FIG. 1,the processor includes one or more functional execution units 42operable for determining memory addresses in response to memory addresscalculation information received from the access register file 34. Thefunctional unit or units 42 utilizes a set of operations to perform theaddress calculations based on the received address calculationinformation. The actual memory accesses may be performed by the samefunctional unit(s) 42 or another functional unit or set of functionalunits. It is thus possible to use a single functional unit to determinea memory address and to effectuate the corresponding memory access.Depending on the load sharing between functional units in the processor,it may be more beneficial to use a functional unit for determining theaddress and another functional unit for the actual memory access. If theaddress determination is more complex, it may be useful to distributethe task of determining the address among several functional units inthe processor.

For efficient transfer of memory address calculation information inrelation to the access register file 34, one or more dedicated datapaths are used for loading the access register file 34 from memoryand/or for transferring the information from the access register file 34to the functional unit or units 42 in the processor. By having a systemof dedicated data paths to and/or from the access register file 34, thememory address calculation information may be transferred in parallelwith other data being transferred to and/or from the general registerfile 32. For example, this means that the access register file 34 mayload address calculation information at the same time as the generalregister file 32 loads other data, thereby increasing the overallefficiency of the system.

The access register file 34 and the dedicated data paths do not have touse the same width as other data paths in the computer system. Thememory address calculation information is often wider than other datatransferred in the computer system, and would therefore normally requiremultiple operations or multi-cycle operations for loading, usingconventional data paths. For this reason, the access register file andits dedicated data path or paths are preferably adapted in width toallow efficient single-cycle transfer of the information. Suchadaptation normally means that a data path may transfer the necessarymemory address calculation information, which may constitute severalwards, in a single clock cycle.

FIGS. 2 to 5 illustrate various embodiments according to the presentinvention with different possible arrangements of dedicated data paths.

In the system of FIG. 2, a dedicated data path 72 is arranged between amemory system 50 and an access register file 34. This dedicated datapath 72 is used for loading memory access information from the memorysystem 50 to the access register file 34. By using the dedicated datapath 72 for transferring memory access information, the load on the datacache 22, the bus 80 and the general register file 32 will be reduced.In addition to the dedicated data path 72, data may be transferred fromthe memory system 50 to the register files 32, 34 via a data cache 22and an optional data bus 80. This cache 22 and data bus 80 primarilyhandles other data than memory access information, but may also transfermemory access information between the memory system 50 and the accessregister file 34 if desired. The information stored in the registerfiles 32, 34 is transferred to a processor 40, preferably by using afurther data bus 82. At least one dedicated functional unit 42 isarranged in the processor 40 for determining memory addresses inresponse to memory access information received from the access registerfile 34. Once a memory address is determined, the corresponding memoryaccess (read or write) may be effectuated by the same or anotherfunctional unit in the processor. As in many modern microprocessors, theprocessor 40 performs write-back of execution results to the data cache22 and/or to the register files 32, 34. As reads to the main memory areissued in the computer system, the system first goes to the cache todetermine if the information is present in the cache. If the informationis available in the cache, a so-called cache hit, access to the mainmemory is not required and the required information is taken directlyfrom the cache. If the information is not available in the cache, aso-called cache miss, the data is fetched from the main memory into thecache, possibly overwriting other active data in the cache. Similarly,as writes to the main memory are issued, data is written to the cacheand copied back to the main memory.

FIG. 3 illustrates another possible arrangement according to the presentinvention. Here a dedicated data path 74 is present between the accessregister file 34 and at least one dedicated functional unit 42 in theprocessor 40. This data path 74 allows fast and efficient transfer ofthe memory access information from the access register file 34 to thefunctional unit 42. The functional unit 42 determines memory addressesin response to the memory access information and may effectuate thecorresponding memory accesses. If desired, the memory access informationmay be transferred from the access register file 34 to the functionalunit 42 through the data bus 82. Usually, however, memory accessinformation is transferred over the dedicated path 74 in parallel withother data being transferred from the general register file 32 to theprocessor 40. This naturally increases the overall system efficiency. Inthe particular embodiment of FIG. 3, both the access register file 34and the general register file 32 are loaded from the memory system 50through the data cache 22 and the optional data bus 80.

FIG. 4 illustrates an embodiment based on a combination of the twodedicated data paths of FIGS. 2 and 3. Here, dedicated data paths 72, 74for transferring memory access information are arranged both between thememory system 50 and the access register file 34 and between the accessregister file 34 and the functional unit(s) 42. This results inefficient transfer of memory access information from the memory system50 to the access register file 34 as well as efficient transfer of theinformation from the access register file 34 to the relevant functionalunit or units 42 in the processor.

As illustrated in FIG. 5, it is possible to introduce a specialdedicated cache for memory access information in order to benefit fromthe advantages of cache memories also for this type of information, andthus reduce the overall load time. Accordingly, a dedicated cache 70 maybe connected between the memory system 50 and the access register file34 with a dedicated data path 73 directly from the cache 70 to theaccess register file 34. The cache 70, which is referred to as an accessinformation cache, is preferably adapted for the memory accessinformation such that the size of the cache words is adjusted to fit thememory access information size.

The particular design of the computer system in which the invention isimplemented may vary depending on the design requirements and thearchitecture selected by the system designer. For example, the systemdoes not necessarily have to use a cache such as the data cache 22. Onthe other hand, the overall memory hierarchy may alternatively have twoor more levels of cache. Also, the actual number of functional processorunits 42 in the processor 40 may vary depending on the systemrequirements. Under certain circumstances, a single functional unit 42may be sufficient to perform the memory address calculations andeffectuate the corresponding memory accesses based on the informationfrom the access register file 34. However, for systems supportingdynamic linking and/or when emulating an instruction set onto anotherinstruction set, it may be more beneficial to use several functionalunits 42 dedicated for memory address calculations and memory accesses,respectively. It is also be possible that some of the functional units42 may handle both memory calculations and memory accesses, possiblytogether with other functions.

For a better understanding of the advantages offered by the presentinvention, a comparison of the memory access bandwidth obtained in aprior art computer system and the memory access bandwidth obtained byusing the invention will now be described with reference to FIGS. 6-8.

In the following examples, the memory access bandwidth, also referred toas fetch bandwidth, is represented by the number of clock cycles, duringwhich input ports are occupied when data is read from the memoryhierarchy (including on-chip caches). It is furthermore assumed that thememory address calculation information for a single memory accesscomprises two words and that the data to be accessed from the determinedmemory address is one word. It is also assumed that the calculation ofthe memory address takes one clock cycle. The assumptions above are onlyused as examples for illustrative purposes. The actual length of thememory address calculation information and the corresponding data, aswell as the number of clock cycles required for calculating a memoryaddress may differ from system to system.

FIG. 6 illustrates three memory reads in a prior art computer systemwith a common data cache, but without a dedicated access register file.In such a computer system, the general register file have to handle boththe memory address calculation information as well as other data. In afirst clock cycle, a first word AI 1-1 of memory access information (AI1: AI 1-1, AI 1-2) related to a first memory read is fetched from thedata cache using the ordinary data bus. In the next clock cycle, thesecond word AI 1-2 of the relevant access information is fetched. Next,the corresponding memory address is determined based on the accessinformation words. Once the memory address has been determined, a firstdata D1 can be read in the following clock cycle. Thus, the first memoryread occupies the data cache port for three clock cycles. The total timerequired to read the first data D1 is of course four clock cycles. Theactual address calculation, however, does not involve any reads, andthis clock cycle could theoretically be used for reading data to anotherinstruction. Similarly, the second memory read occupies the data cacheport for three clock cycles, two cycles for reading the relevant accessinformation (AI 2: AI 2-1, AI 2-2) and one cycle for reading the actualdata (D2). In the same way, the third memory read occupies the datacache port for three clock cycles, two cycles for reading the relevantaccess information (AI 3: AI 3-1, AI 3-2) and one cycle for reading theactual data (D3).

FIG. 7 illustrates the same three memory reads in a computer systemaccording to an embodiment of the invention. This computer system has adedicated access register file or holding memory address calculationinformation, and preferably also a dedicated access information cacheconnected to the access register file. This means that accessinformation words may be read into the access register file at the sametime as data words of previous, memory reads are read from the memory.Starting with first memory read, a first word AI 1-1 and second word AI1-2 of the memory access information is read by the access registerfile. This information is forwarded to the functional unit(s) of theprocessor for determining the corresponding memory address. During thisclock cycle of memory address calculations, a first word AI 2-1 of thememory access information related to a second memory read is read intothe access register file. In the next clock cycle, memory address of thefirst memory read is ready and a first data word D1 may be read. At thesame time as the data word D1 is read, the access register file readsthe second memory access information word AI 2-2 of the second memoryread. In the next clock cycle, at the same time as the memory address ofthe second memory read is determined, the first word AI 3-1 of theaccess information of the third memory read is read into the accessregister file. As the second data word D2 is read, the second word AI3-2 of the access information of the third memory read is read into theaccess register file. Finally, in the next clock cycle, the third dataword D3 is read from the memory. It can be seen that each time theaccess register file reads a second word of memory access information, adata word of a previous memory read is read concurrently from the cache,which results in an increase in the effective memory access bandwidth.

FIG. 8 illustrates the same three memory reads in a computer systemaccording to another embodiment of the invention. This computer systemnot only has a dedicated access register file and optional accessinformation cache, but also data paths adapted in width for transferringthe memory access information in the system. The width-adapted datapaths allow all memory access information, i.e. both the first andsecond word, to be read in a single clock cycle. Thus, in the firstclock cycle, both the first word AI 1-1 and second word AI 1-2 of memoryaccess information are read from the access information cache into theaccess register file using a wide interconnect (shown as ‘high’ and‘low’ bus branches). The second clock cycle is used for reading a firstword AI 2-1 and second word AI 2-2 of the memory access information of asecond memory read, as well as for determining the memory address of thefirst memory read. In the next clock cycle, the data word D1 of thefirst memory read is accessed. At the same time, the access informationwords AI 3-1, AI 3-2 of the third memory read are read from the accessinformation cache to the access register file, and the memory address ofthe second memory read is determined. Subsequently, the data word D2 ofthe second memory read is accessed, and the memory address of the thirdmemory read is determined. Finally, the data word D3 of the third memoryread can be accessed. Consequently, a memory read now occupies the wideraccess information cache port in one clock cycle and the data cache portin another clock cycle. By pipelining memory accesses, this approachenables one memory read per clock cycle and memory port. This representsa significant increase of the effective memory access bandwidth,compared to prior art systems.

The present invention is particularly advantageous in computer systemshandling large amounts of memory address calculation information,including systems emulating another instruction set or systemssupporting dynamic linking (late binding).

For example, when emulating a complex CISC instruction set on a RISC orVLIW processor, the complex CISC operations can not be directly mappedto a corresponding RISC instruction or to an operation in a VLIWinstruction. Instead, each complex memory operation is mapped into asequence of instructions that in turn performs e.g. memory addresscalculations, memory mapping and checks. In conventional computersystems, the emulation of the complex memory operations generallybecomes a major bottleneck.

The invention will now be described with reference to an example ofVLIW-based implementation suitable for emulating a complex CISCinstruction set. In general, VLIW-based processors ty to exploitinstruction-level parallelism, and the main objective is to eliminatethe complex hardware-implemented instruction scheduling and paralleldispatch used in modern superscalar processors. In the VLIW approach,scheduling and parallel dispatch are performed by using a specialcompiler, which parallelizes instructions at compilation of the programcode.

FIG. 9 is a schematic block diagram of a VLIW-based computer systemaccording to an exemplary embodiment of the present invention. Theexemplary computer system basically comprises a VLIW-based CPU 10 and amemory system 50. In this particular embodiment, the VLIW-based CPU 10is built around a six-stage pipeline: Instruction Fetch, InstructionDecode, Operand Fetch, Execute, Cache Access and Write-Back. Thepipeline includes an instruction fetch unit 90, an instruction decodeunit 92 together with additional functional execution and branch units42-1, 42-2, 44-1, 44-2 and 46. The CPU 10 also comprises a conventionaldata cache 22 and a general register file 32. The system is primarilycharacterized by an access information cache 70, an access register file34 and functional access units 42-1, 42-2 interconnected by dedicateddata paths. The access information cache 70 and the access register file34 are preferably dedicated to hold only memory access information andthus normally adapted to the access information size. By using separatedata paths adapted in width to memory access information, it is possibleto transfer memory access information that is wider than other normaldata without introducing multi-cycle transfers.

In operation, the instruction fetch unit 90 fetches a VLIW word,normally containing several primitive instructions, from the memorysystem 50. In addition to normally occurring instructions, the VLIWinstructions preferably also include special-purpose instructionsadapted for the present invention, such as instructions for loading theaccess register file 34 and for determining memory addresses based onmemory access information stored in the access register file 34. Thefetched instructions whether general or special are decoded in theinstruction decode unit 92. Operands to be used during execution aretypically fetched from the register files 32, 34, or taken as immediatevalues 88 derived from the decoded instruction words. Operandsconcerning memory address determination calculation and memory accessesare found in the access register file 34 and other general operands arefound in the general register file 32. Functional execution units 42-1,42-2; 44-1, 44-2 execute the VLIW instructions more or less in parallel.In this particular example, there are functional access units 42-1, 42-2for determining memory addresses and effectuating the correspondingmemory accesses by executing the decoded special instructions.Preferably, the ALU units 44-1, 44-2 execute special-purposeinstructions for reading access information from the access informationcache 70 into the access register file 34. The reason for letting theALU units execute these read instructions is typically that a betterinstruction load distribution among the functional execution units ofthe VLIW processor is obtained. The instructions for reading accessinformation to the access register file 34 could equally well beexecuted by the access units 42-1, 42-2. Execution results can bewritten back to the data cache 22 (and copied back to the memory system50) using a write-back bus. Execution results can also be written backto the access information cache 70, or to the register files 32, 34using the write-back bus.

In order to streamline the transfer of data in the computer system ofFIG. 9, forwarding paths 76, 84, 86 may be introduced. This is usefulwhen the instructions for handling the memory access information aresimilar to the basic instructions for integers and floating points, i.e.load instructions for loading data to the access register file 34 andregister-register instructions for processing the memory accessinformation. A forwarding path 84 may be arranged from the write-backbus to operand bus 82 leading to the functional units 42-1, 42-2, 44-1,44-2, 46. Such a forwarding path 84 makes it possible to use the outputfrom one register-register instruction directly in the nextregister-register instruction without passing the data via the registerfiles 32, 34. In addition, a forwarding path 86 may be arranged from thegeneral data cache 22 to the operand bus 82 and the functional units42-1, 42-2; 44-1, 44-2. With such an arrangement the one clock cyclepenalty of writing the data to the general register file 32 and readingit therefrom in the next clock cycle is avoided. In a similar way, awider forwarding path 76 may be arranged for forwarding accessinformation directly from the dedicated cache 70 to the dedicatedfunctional units 42-1, 42-2.

For a more thorough understanding of the operation and performance ofthe VLIW-based computer system of FIG. 9, a translation of an exemplarysequence of ASA instructions into primitive instructions (primitives),and scheduling of the primitives for parallel execution by the VLIWprocessor will now be described. The example is related to the APZmachine from Ericsson.

Table I below lists an exemplary sequence of ASA instructions. Theinstruction set supports dynamic linking. A logical variable is readfrom a logical data store using a RS (read store) instruction thatimplicitly accesses linking information and calculates the physicaladdress in memory. TABLE I Execution cycle Instruction Comment 00580032RSA DR0- 3; : read logical variable no 3 to register DR0 00580033 JECDR0, 1, %L%392C; : jump if DR0 equal to 1 to label 00580048 RSU DR0- 75;: load unsigned logical variable 75 to DR0 00580048 LHC CR/W0-20; : load20 to register CR/W0 00580049 JER DR0, %L%3938; : jump if register is 0to label 00580049 LHC GR/W0-21; : load 21 to register GR/W0 00580050 JERDR0, %L%393E; : jump if register is 0 to label 00580065 RSU DR0- 159; :read logical variable no 159 to register DR0 00580066 JUC DR0, 1,%L%3A6C; : jump if DR0 unequal to 1 to label 00580067 WZU 11; : writezero to logical variable no 11 00580068 WZU 12; : write zero to logicalvariable no 12 00580079 RSA DR0- 1; : read logical variable no 1 toregister DR0 00580080 JUC DR0, 2, %L%40E9; : jump if DR0 unequal to 2 tolabel 00580081 WHCU 1- 3; : write 3 to logical variable no 1 00580082JLN %L%40E9; : jump to label 00580082 MFR PR0- WR18; : move fromregister to register 00580083 WZU 11; : write zero to logical variableno 11 00580084 WZU 12; : write zero to logical variable no 12 00580084LCC DR0- 0; : load 0 to register DR0 00580085 WSSU 71/B7-DR0; : writebit7 in var71 with value in DR0. 00580115 RSA DR0- 1; : read logicalvariable no 1 to register DR0 00580116 JUC DR0, 1, %L%40F6; : jump ifDR0 un equal to 1 to label 00580117 WZL 426; : write zero to logicalvariable no 426 00580117 RSA DR0- 28; : read logical variable no 3 toregister DR0 00580118 LWCD CR/D0-65535; : load 65535 to register CR/D000580119 JUR DR0, %L%4105; : jump if DR0 equal to 0 to label 00580120WSA 28-WR18; : write register contents to logical variable 28 00580121WSU 82-WR18; : write register contents to logical variable 82 00580122WOU 63; : write all ones to logical variable 63 00580123 WHCU 29-1; :write 1 to logical variable no 29 00580124 JLN %L%410F; : jump to label

As illustrated in table II below, the ASA sequence may be translatedinto primitives for execution on the VLIW-based computer system. In anexemplary embodiment of the invention, APZ registers such as PRO, DRx,WRx and CR/W0 are mapped to VLIW general registers, denoted grxxx below.The VLIW processor generally has many more registers, and therefore, thetranslation also includes register renaming to handle anti-dependencies,for example as described in Computer Architecture: A QuantitativeApproach by J. L. Hennessy and D. A. Patterson, second edition 1996, pp.210-240, Morgan Kaufmann Publishers, California. The compiler performsregister renaming and, in this example, each write to an APZ registerassigns a new grxx register in the VLIW architecture. Registers in theaccess register file, denoted arxxx below, are used for addresscalculations performing dynamic linking that are implicit in theoriginal assembler code. A read store, RSA in the assembler code above,is mapped to a sequence of instructions: LBD (load linkage information),ACVLN (address calculation variable length), ACP (address calculationpointer), ACI (address calculation index), and LD (load data). Theexample assigns a new register in the ARF for each step in thecalculation when it is updated. A write store performs the same sequencefor the address calculation and then the last primitive is an SD (storedata) instead of LD (load data).

The memory access information is loaded into the access register file 34by a special-purpose instruction LBD. The LBD instruction uses aregister in the access register file 34 as target register instead of aregister in the general register file 32. The information in the accessregister file 34 is transferred via a dedicated wide data path,including a wide data bus 74, to the functional access units 42-1, 42-2.These functional units 42-1, 42-2 perform the memory address calculationin steps by using special instructions ACP and ACVLN, and finallyeffectuates the corresponding memory accesses by using a loadinstruction LD or a store instruction SD.

Redundant primitives are revealed when complex instructions are brokenup into primitives, and normally removed. When the address calculationis made explicit in this way it is possible for the code optimizer toremove unnecessary steps, for example ACI and ACP is only needed for oneand two dimensional array variables and ACVLN is not needed for normal16-bit variables. Also, it is not necessary to redo the addresscalculations, or parts of it, when having multiple accesses to the samevariable. TABLE II . . . ACP ar99, PR0 -> ar104 : ar104.addr: =pr0*2{circumflex over ( )}(ar99.v + ar99.q) SD gr50 -> (ar104) : storedata in gr50 in ar104 address ACP ar101, PR0 -> ar105 : calc. addr. fromvalues in ar101 SD gr50 -> (ar105) : store data in gr50 at ar105.addrACVLN ar73, B7 -> ar106 : calc. (add) var. length part of addr. SD DR0-> (ar106) : store DR0 value at resulting address ACP ar2, PR0 -> ar107: calculate pointer part of var. address LD (ar107) -> DR0 : loadregister DR0 from addr. in ar107. CMPC DR0,#1 -> p28 : compare equalitywith constant LIR #1 -> gr71 : load immediate to register LBD 42 ->ar108 : load addr. calc data (v & q) into ar108 ACP ar108,PR0 -> ar109 :calculate pointer part of address SEL p28,gr71, : select depending onpxx gr50 -> gr72 SD gr72 -> (ar109) : store data in gr72 at address inar109 LBD 28 -> ar110 : load address calc. data into ar110 LD (ar110) ->DR0 : load data from address in ar110 CMPC DR0,#65535 -> p29 : compareequality with constant CJMPI p29, ... : conditional jump if pxx not trueSD WR18 -> (ar110) : store data LBD 82 -> ar111 : load address data fromtable (idx: 82) SD WR18 -> (ar111) : store data LBD_(c) 63 -> ar112 :load address data, table index is 63 ACP ar112,PR0 -> ar113 : calculateaddres with pointer in PR0 LIR #−1 -> gr76 : load immediate to registerSD gr76 -> (ar113) : store data from gr76 to ar113.addr LBD 29 -> ar114: load address data (ar114.v, ar114.q). LIR #1 -> gr78 : load immediateto register SD gr78 -> (ar114) : store data in gr78 at ar114.addr JL . .. : jump to label

These primitives can be scheduled for parallel execution on the VLIWsystem of FIG. 9 as illustrated in Table III below: TABLE III AccessUnit 1 Access Unit 2 ALU 1 ALU 2 Branch Unit ACP ar99, PR0->ar104 LD(ar107)->DR0 LBD 28->ar110 LBD 426->ar108 SD gr50->(ar104) ACVLN ar73,B7->ar106 LBD 63-> ar112 LBD 82 -> ar111 ACP ar101, PR0->ar105 SDDR0->(ar106) CMPC DR0,#1->p28 LIR #1 -> gr71 LD (ar110)->DR0 ACPar108,PR0->ar109 SEL p28,gr71,gr50->gr72 LBD 29->ar114 SD gr50 ->(ar105) SD gr72->(ar109) LBD 434->ar115 LIR #−1->gr76 ACPar112,PR0->ar113 CMPC DR0,#65535->p29 LIR #1->gr78 SD p29,WR18->(ar110)SD p29,WR18->(ar111) CJMPI p29, . . . SD gr76->(ar113) SD gr78->(ar114)JL . . .

The example above assumes a two-cycle load-use latency (one delay slot)for accesses both from the access information cache and from the datacache, and can thus be executed in eight clock cycles if there are nocache misses.

The advantage of the invention is apparent from the first line of code(in Table III), which includes three separate loads, two from the accessinformation cache 70 and one from the data cache 22. The memory accessinformation is two words long in the example, which means that 5 wordsof information is loaded in one clock cycle. In the prior art, thiswould normally require 3 clock cycles, even when implementing adual-ported cache.

It can be noted that separate “address registers” or “segment registers”are used in many older processor architectures such as Intel IA32 (×86processor), IBM Power and HP PA-RISC. However, these registers areusually used for holding an address extension that is concatenated withan offset for generating an address that is wider than the word lengthof the processor (for example generating a 24 bit or 32 bit address on a16 bit processor). These address registers are not related to step-wisememory address calculations, nor supported by a separate cache anddedicated load path.

In the article HP, Intel Complete IA64 Rollout, by K. Diefendorff,Microprocessor Report, Apr. 10, 2000, a VLIW architecture with separate“region registers” is proposed. These registers are not directly loadedfrom memory and there are no special instructions for addresscalculations. The registers are simply used by the address calculationhardware as part of the execution of memory access instructions.

The VLIW-based computer system of FIG. 9 is merely an example of apossible computer system suitable for emulation of a CISC instructionset. The actual implementation may differ from application toapplication. For example, additional register files such as a floatingpoint register file and/or graphics/multimedia register files may beemployed. Likewise, the number of functional execution units may bevaried within the scope of the invention. It is also possible to realizea corresponding implementation on a RISC computer.

The invention is particularly useful in systems using dynamic linking,where the memory addresses of instructions and/or variables aredetermined in several steps based on indirect or implicit memory accessinformation. In systems with dynamically linked code that can bereconfigured during operation, the memory addresses are generallydetermined by means of look-ups in different tables. The initial memoryaddress information itself does not directly point to the instruction orvariable of interest, but rather contains a pointer to a look-up tableor similar memory structure, which may hold the target address. Ifseveral table look-ups are required, a lot of memory address calculationinformation must be read and processed before the target address can beretrieved and the corresponding data accessed. By implementing anycombination of a dedicated access information cache, a dedicated accessregister file and functional units adapted to perform the necessarytable look-ups and memory address calculations, the memory accessbandwidth and overall performance of computer systems using dynamiclinking will be significantly improved.

Although the improvement in performance obtained by using the inventionis particularly apparent in applications involving emulation of anotherinstruction set and dynamic linking, it should be understood that thecomputer design proposed by the invention is generally applicable.

The clock frequency of any chip implemented in deep sub-microntechnology (0.15 μm or smaller) is limited by the delays in theinterconnecting paths. Interconnect delays are minimized with a smallnumber of memory loads and by keeping wiring short. The use of adedicated access register file and a dedicated access information cachemakes it possible to target both ways of minimizing the delays. Theaccess register file with its dedicated load path gives a minimal numberof memory loads. If used, the access information cache can be co-locatedwith the access register file on the chip, thus reducing the requiredwiring distance. This is quite important since modern microprocessorshave the most timing critical paths in connection with first level cacheaccesses.

The embodiments described above are merely given as examples, and itshould be understood that the present invention is not limited thereto.Further modifications, changes and improvements which retain the basicunderlying principles disclosed and claimed herein are within the scopeand spirit of the invention.

1. A computer system comprising: a special-purpose register file adaptedfor holding memory address calculation information received from memory,said special-purpose register file having at least one dedicatedinterface for allowing efficient transfer of memory address calculationinformation in relation to said special-purpose register file; means fordetermining a memory address in response to memory address calculationinformation received from said special-purpose register file, thusenabling a corresponding memory access.
 2. The computer system accordingto claim 1, further comprising means for effectuating a memory accessbased on the determined memory address.
 3. The computer system accordingto claim 1, wherein said at least one dedicated interface comprises adedicated interface between said special-purpose register file andmemory.
 4. The computer system according to claim 1, wherein said atleast one dedicated interface comprises a dedicated interface betweensaid special-purpose register file and said means for determining amemory address.
 5. The computer system according to claim 1, whereinsaid at least one dedicated interface includes a dedicated data pathadapted in width to said memory address calculation information.
 6. Thecomputer system according to claim 1, wherein said memory comprises adedicated cache adapted for said memory address calculation information.7. The computer system according to claim 1, wherein said means fordetermining a memory address comprises at least one functional processorunit.
 8. The computer system according to claim 7, wherein a forwardingdata path is arranged from an output bus associated with said at leastone functional processor unit to an input bus associated with said atleast one functional processor unit.
 9. The computer system according toclaim 1, wherein said means for determining a memory address is operablefor executing special-purpose instructions in order to determine saidmemory address.
 10. The computer system according to claim 1, furthercomprising means for executing special-purpose load instructions inorder to load said memory address calculation information from saidmemory to said special-purpose register file.
 11. The computer systemaccording to claim 10, wherein said means for executing special-purposeload instructions comprises at least one functional processor unit. 12.The computer system according to claim 11, wherein a forwarding datapath is arranged from said memory to an input wherein said memoryaddress calculation information is in the form of implicit memory accessinformation.
 14. The computer system according to claim 13, wherein saidimplicit memory access information includes memory address translationinformation.
 15. A computer system comprising: a dedicated cache adaptedfor holding memory access information; a special-purpose register fileadapted for holding memory access information received from saiddedicated cache over a first dedicated interface; means for determininga memory address in response to memory access information received fromsaid special-purpose register file over a second dedicated interface;and means for effectuating a corresponding memory access based on thedetermined memory address.
 16. The computer system according to claim15, wherein said first and second dedicated interfaces are adapted inwidth to said memory address calculation information.
 17. A method ofhandling memory address calculation information, said method comprisingthe steps of: holding memory address calculation information receivedfrom memory, in a special purpose register file, transferring memoryaddress calculation information in relation to said special-purposeregister file via at least one dedicated interface associated with saidspecial purpose register file; and determining a memory address inresponse to memory address calculation information received from saidspecial-purpose register file, thus enabling a corresponding memoryaccess.
 18. The method according to claim 17, further comprising thestep of effectuating a memory access based on the determined memoryaddress.
 19. The method according to claim 17, wherein said at least onededicated interface comprises a dedicated interface between saidspecial-purpose register file and memory.
 20. The method according toclaim 17, wherein said at least one dedicated interface comprises adedicated interface between said special-purpose register file and so ameans for determining a memory address.
 21. The method according toclaim 17, further comprising the step of adapting a dedicated data pathin width to said memory address calculation information.
 22. The methodaccording to claim 17, further comprising the step of utilizing adedicated cache adapted for said memory address calculation information.